3 Explanatory modeling vs Predictive modeling

Before diving into the actual analysis, it is important to give a theorical introduction in which the choice of the data analysis methodology happened. Therefore, in this chapter we will describe the context for which the Qualitative Comparative analysis arose and why it is fit to unpacking an ESG rating.

3.1 The black box

Statistical modeling has long been following two approaches to reach conclusion from data: on one axis, one assumes that data is drawn and generated by a data model that must be identified; on the other axis, one creates and uses algorithmic models from the available data without exploring the underling model which therefore remains unknown. The statistical community has historically been constant in favoring almost exclusively the first approach: data models. The exclusive dependence to this culture, says Leo Breiman in “Statistical modeling: the two cultures” (2001), has led to a restrain in the application of statistics in many new fields where it is currently difficult to explore the underlying data model. Breinman point is that if statisticians use data to solve problems, then they should distance themselves from such a strict dependence on data models and openly adopt a more diverse set of tools now available through the algorithmic modeling culture.

To clarify the different approach between the data modeling culture and the algorithmic modeling one, one should focus on the diverse approach towards the so called “black box”.

Firstly, we should consider that statistics is based on data and that data is generated by a black box (which in the above scheme is referred to as “nature”) in which a set of variables x (independent variables) enter and come out as y (response variables). Within the black box, nature associates the x variables with the y variables.

At this point, there are two main objectives in studying the phenomenon and these lay in analyzing data to reach: * Prediction: being able to predict which are going to be the responses to future or hypothetical input variables * Information: being able to understand how nature associates input variables x to response variables y.

These two objectives are approached to differently by the data modeling culture and the algorithmic modeling one.

3.1.1 The Data Modeling approach

The main assumption of the data modeling culture is that the black box should be studied and explored in order to create a stochastic data model for what happens within it. A classic data model is that data are generated by a set of predictor variables, random noise and parameters that are estimated from empirical evidence. At this point the defined model is used for both prediction and information goals.

3.1.2 The Algorithmic Modeling approach

The main starting point of the algorithmic modeling culture, instead, is the acceptance of the black box as an unknown operating process. The focus therefore, shifts towards the identification of a function operating on x that best predicts y giving the produced model great strength and accuracy in prediction.

3.2 To explain or to predict

Breiman, professor at the University of California Berkeley, is extremely critical of the data modeling approach and uses his experience as a consultant and his algorithmic modeling knowledge to state his point especially for all those field where it is obvious that data models are not applicable or are too complex to be viable: speech and handwriting recognition or non linear time series and financial market predictions. His strongest leverage lays directly in projects commissioned by the Environmental Protection Agency in which he took part where the algorithmic model approach gave a crucial and substantial help when a swift and effective response was needed. He states that “With data gathered from uncontrolled observations on complex systems involving unknown physical, chemical, or biological mechanisms, the a priori assumption that nature would generate the data through a parametric model selected by the statistician can result in questionable conclusions that cannot be substantiated by appeal to goodness-of-fit tests and residual analysis.” (Breiman, 2001)

Breiman’s paper has stir up mixed feelings among the statistical community and many experts had promptly reacted by publishing comments to such paper. One among them is David R. Cox, leading exponent and supporter of the data modeling approach. He argues that even if one ignores design aspects and directly operates on data without stating any scientific hypothesis, crucial concerns would still arise: what is the meaning of data, which may be the biases of the sample, what distortions may arise due to incomplete data or underlying dependencies occurring. The path Cox points out could definitely be considered as a third path between the data modeling and the algorithmic modeling culture, less drastic than the stance taken by made by Breiman. Cox serenely acknowledges that the choice of path is strongly dependent on the context and that there are situations where a predictive and practical approach turns out to be preferable (e.g. for short term forecasting). Still, when the need for immediate feedback is outdated and tasks get more complex, prediction without having conducted substantial exploration of the underlying processes that associate phenomena, generally become less and less accurate. Cox ends leaving us with a rather thought-provoking question: “Professor Breiman takes a rather defeatist attitude toward attempts to formulate underlying processes; is this not to reject the base of much scientific progress?” (Cox, 2001)

Joining Cox comment to Breiman’s paper, Professor Efron takes a similar stance towards the second. He presents a few points that might help contextualize and resize the radical rejection brought forward by Breiman: - new methods always seem to unhinge old methods and holistic comparisons between approaches are difficult to make without bias or defined contexts; - complicated methods are more difficult to criticize if compared to simple ones. Limitations of long established data models are easy to point out since they are simpler and better known to date; - the algorithmic model culture is in fact more spread within the statistical community that what Breiman feels it is. (Efron, 2001)

Efron, similarly to Cox, strongly believes that the ultimate and highest goal of science is to crack open the black boxes and understand the natural association processes that happen within them in order to be able to interact with phenomena for the purposes of mankind. Still, it cannot be said that some explanatory models without predictive power or vice versa, predictive models without explanatory power are categorically useless in scientific terms. In his 2010 paper “To Explain or to Predict”, Galit Shmueli brings the example of an explanatory model that cannot be tested in terms of predictive accuracy and still has been a fundamental step forward in modern science: the Darwinian evolution theory. Conversely, a predictive model that does not have a great explanatory power yet is scientifically significant is Galileo’s demostration on light instantaneousness. (Shmueli, 2010)

In an overall ideal view, the data modeling approach and the algorithmic one should be applied complementarily: theory drawn by the study of the black box guides operational choices for the algorithm construction in order to produce accurate and theoretically supported forecasts.